Members
Overall Objectives
Research Program
Application Domains
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: New Results

Big Data Integration

Probabilistic Data Integration

Participants : Reza Akbarinia, Naser Ayat, Patrick Valduriez.

Data uncertainty in scientific applications can be due to many different reasons: incomplete knowledge of the underlying system, inexact model parameters, inaccurate representation of initial boundary conditions, inaccuracy in equipments, error in data entry, etc.

An important problem that arises in big data integration is that of Entity Resolution (ER). ER is the process of identifying tuples that represent the same real-world entity. The problem of entity resolution over probabilistic data (which we call ERPD) arises in many distributed application domains that have to deal with probabilistic data, ranging from sensor databases to scientific data management. The ERPD problem can be formally defined as follows. Let e be an uncertain entity represented by multiple possible alternatives, i.e., tuples, each with a membership probability. Let D be an uncertain database composed of a set of tuples each associated with a membership probability. Then, given e, D, and a similarity function F, the problem is to find the entity-tuple pair (t,ti) (where te,tiD) such that (t,ti) has the highest cumulative probability to be the most similar in all possible worlds. This entity-tuple pair is called the most probable match pair of e and D, denoted by MPMP(e, D).

Many real-life applications produce uncertain data distributed among a number of databases. Dealing with the ERPD problem for distributed data is quite important for such applications. A straightforward approach for answering distributed ERPD queries is to ask all distributed nodes to send their databases to a central node that deals with the problem of ER by using one of the existing centralized solutions. However, this approach is very expensive and does not scale well neither in the size of databases, nor in the number of nodes.

In [24] , we propose an efficient solution for the ERPD problem. Our contributions are summarized as follows. We adapted the possible worlds semantics of probabilistic data to define the problem of ERPD based on both similarity and probability of tuples. We proposed a PTIME algorithm for the ERPD problem. This algorithm is applicable to a large class of the similarity functions, where the similarity score of two tuples depends only on their attributes i.e., context-free functions. For the rest of similarity functions (i.e., context-sensitive), we proposed a Monte Carlo approximation algorithm. We also proposed a parallel version of our Monte Carlo algorithm using the MapReduce framework. We conducted an extensive experimental study to evaluate our approach for ERPD over both real and synthetic datasets. The results show the effectiveness of our algorithms.

Another topic of interest is the integration of large astronomy data catalogs. The main challenge in such integration, besides the huge amount of catalog data to be merged, is the weak identification of sky objects, which leads to ambiguities in object matching amongst catalogs. In cite [30] , we present the NACluster algorithm. NACluster considers a Euclidian metric space and distance function to drive disambiguation amongst objects in various catalogs and extends the traditional k-means algorithm to deal with the dynamic creation of new clusters, representing real sky objects. NACluster shows F-measure results steadily superior to the Q3C join operator matching results, which is its closest competitor.

CloudMdsQL, a query language for heterogeneous data stores

Participants : Carlyna Bondiombouy, Boyan Kolev, Oleksandra Levchenko, Patrick Valduriez.

The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. The CoherentPaaS European project addresses this problem, by providing a common programming language and holistic coherence across different cloud data stores.

In this context, we have started the design of a Cloud Multi-datastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQLlike language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. Thus, CloudMdsQL unifies a quite diverse set of data management technologies while preserving the expressivity of their local query languages. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatabase query language.

Semantic Data Integration using Bio-Ontologies

Participants : Emmanuel Castanier, Patrick Valduriez.

Biologist have adopted ontologies for several reasons: (1) to provide canonical representation of scientific knowledge; (2) to annotate experimental data to enable interpretation, comparison, and discovery across databases; (3) to facilitate knowledge-based applications for decision support, natural language processing and data integration. The challenge is to automatically process complex databases and generate mappings using relevant ontologies in a way that scales up for many resources and ontologies, while being easy to use for the biomedical community, customizable to fit specific needs and smart, in order to leverage the knowledge contained in ontologies.

The National Center for Biomedical Ontology (NCBO) has developped a popular ontology-based annotation workflow. To address the above challenge, we have integrated the NCBO annotator with our WebSmatch tool and the Biosemantic tool from IRD to perform semantic annotation using bio-ontologies [47] . The resulting tool provides very useful capabilities. First, it can convert SQL database schemas to RDF/RDFS with Biosemantic. Second, it can annotate with the NCBO annotator and WebSmatch using the NCBO resources index. Third, the NCBO annotator relies on WebSmatch to create mappings between elements of schemas and ontological concepts, and uses ontologies properties (i.e. subsomption, transitivity) to enhance matching techniques.

Unlike the bio-medical domain which has accepted ontologies as a means to manage (integrate) knowledge, the agronomic sciences is yet to exploit its full potential. To this end, we are currently developing an RDF knowledge base, Agronomic Linked Data (AgroLD) [50] . The knowledge base is designed to integrate data from various publically available plant centric data sources. The aim of AgroLD project is to collaborate with domain experts in bridging the gap between technology and its potential users to enhance biological research.